perf: optimize qwen3.5 hybrid linear cache flow[4/N]. by JC-ut0 · Pull Request #1160 · jd-opensource/xllm

JC-ut0 · 2026-04-01T09:16:09Z

Add logic to AclGraph to correctly identify valid KV caches in mixed-layer models, and refactor WorkerImpl to selectively allocate specific cache tensors (conv/ssm vs. key/value) per layer.

gemini-code-assist

Code Review

This pull request introduces support for hybrid attention models (such as qwen3_next) by differentiating between full attention and linear (GDN) attention layers during KV cache estimation and allocation. Key changes include updating LLMEngine and RecEngine to calculate cache capacity based on specific layer types, adding logic to AclGraph to correctly identify valid KV caches in mixed-layer models, and refactoring WorkerImpl to selectively allocate specific cache tensors (conv/ssm vs. key/value) per layer. Review feedback highlights the need for better consistency across the engine by utilizing the centralized is_full_attention_layer helper function to avoid logic errors related to default attention intervals and potential division-by-zero issues.

xllm/core/distributed_runtime/llm_engine.cpp

xllm/core/distributed_runtime/rec_engine.cpp

xllm/core/runtime/worker_impl.cpp

yingxudeng · 2026-04-02T06:01:19Z

xllm/core/runtime/worker_impl.cpp

+                           torch::dtype(dtype_).device(device_)),
+              2);
+        }
 #elif defined(USE_ILU) || defined(USE_MLU) || defined(USE_MUSA)


The #elif defined(USE_ILU) || defined(USE_MLU) || defined(USE_MUSA) and #else branches appear to have identical behavior — what's the reason for splitting them?

This follows the original code style below.

xllm/core/runtime/worker_impl.cpp

Removed unused layer types variable from worker_impl.cpp

JC-ut0 requested review from DongheJin, JimHsiung, RobbieLeung, XuZhang99, liutongxuan, walsonyang and yq33victor as code owners April 1, 2026 09:16

gemini-code-assist bot reviewed Apr 1, 2026

View reviewed changes

xllm/core/distributed_runtime/llm_engine.cpp Show resolved Hide resolved

xllm/core/distributed_runtime/rec_engine.cpp Outdated Show resolved Hide resolved

xllm/core/runtime/worker_impl.cpp Show resolved Hide resolved

perf: optimize qwen3.5 hybrid linear cache flow[4/N].

b403a03

JC-ut0 force-pushed the gdn_cache_fix branch from 9494758 to b403a03 Compare April 2, 2026 04:22

yingxudeng reviewed Apr 2, 2026

View reviewed changes

xllm/core/runtime/worker_impl.cpp Outdated Show resolved Hide resolved

Remove unused layer types variable

75a926c

Removed unused layer types variable from worker_impl.cpp

yingxudeng approved these changes Apr 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: optimize qwen3.5 hybrid linear cache flow[4/N].#1160

perf: optimize qwen3.5 hybrid linear cache flow[4/N].#1160
JC-ut0 wants to merge 2 commits intojd-opensource:mainfrom
JC-ut0:gdn_cache_fix

JC-ut0 commented Apr 1, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yingxudeng Apr 2, 2026

Uh oh!

JC-ut0 Apr 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JC-ut0 commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yingxudeng Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

JC-ut0 Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JC-ut0 commented Apr 1, 2026 •

edited

Loading